2023 iThome 鐵人賽

DAY 28

自我挑戰組

深度學習的學習 & ASR 中文語音辨識系列第 28 篇

【Day 28】Whisper model 的快樂 Fine-tuning 時間 - 3

15th鐵人賽

leo271828

2023-10-13 23:16:31

5279 瀏覽

分享至

你以為我要說怎麼把自己的資料轉成 Datasets 相符的格式嗎？不要。要準備資料好麻煩
所以我就繼續講下去囉
這邊可以用一個變數來表示我們要 fine-tune 哪一個原始模型，這邊選擇 Whisper small 版
selected_model = "openai/whisper-small"
可能會有人問為何不要用 medium 的？
因為不是大家的 GPU 都很厲害，我會在下一篇提到如果用 medium 會怎麼樣

Import

再來是特徵提取，會用到 FeatureExtractor
下面記得改成你要的模型

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

這邊則是 Tokenizer，要先把模型抓進來 pre-trained

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")

這邊再多一個 Processor

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")

Prepare Data

然後把所有音檔的採樣頻率轉成 16k

from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

print(common_voice["train"][0])

印出來 sample_rate 的資訊應該都變成了 16000

對 Dataset 做些小處理

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

這邊還要注意一下後面原本是 num_proc=4，跑下去有問題或電腦沒有那麼強切成 1 或許會比較好

Data Collator

import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

那就先到這！

Ref.

https://huggingface.co/blog/fine-tune-whisper

再次提醒，我這幾篇全都是出自這篇文章

小心得

來台北好累，剩下三天！

【Day 27】Whisper model 的快樂 Fine-tuning 時間 - 2

【Day 29】Whisper model 的快樂 Fine-tuning 時間 - 4

系列文

深度學習的學習 & ASR 中文語音辨識共 30 篇

RSS系列文訂閱系列文

2 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

深度學習的學習 & ASR 中文語音辨識系列 第 28 篇